Skip to content

ENH: select_dypes impl #7434

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 7, 2014
Merged

ENH: select_dypes impl #7434

merged 1 commit into from
Jul 7, 2014

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Jun 11, 2014

closes #7316

examples:

In [8]: paste
   df = DataFrame({'a': list('abc'),
                   'b': list(range(1, 4)),
                   'c': np.arange(3, 6).astype('u1'),
                   'd': np.arange(4.0, 7.0),
                   'e': [True, False, True],
                   'f': [False, True, False],
                   'g': pd.date_range('now', periods=3).values})
   df['h'] = df.g.diff()
   df['i'] = np.arange(3, 6).astype('u8')
   df['j'] = pd.date_range('20130101', periods=3).values
   df
## -- End pasted text --
Out[8]:
   a  b  c  d      e      f                   g      h  i          j
0  a  1  3  4   True  False 2014-06-22 23:40:53    NaT  3 2013-01-01
1  b  2  4  5  False   True 2014-06-23 23:40:53 1 days  4 2013-01-02
2  c  3  5  6   True  False 2014-06-24 23:40:53 1 days  5 2013-01-03

In [9]: paste
   df.select_type(include=[bool])
## -- End pasted text --
Out[9]:
       e      f
0   True  False
1  False   True
2   True  False

In [10]: paste
   df.select_type(include=['number', 'bool'], exclude=['unsignedinteger'])
## -- End pasted text --
Out[10]:
   b  d      e      f      h
0  1  4   True  False    NaT
1  2  5  False   True 1 days
2  3  6   True  False 1 days

In [11]: np.timedelta64.mro()  # this is an integer type
Out[11]:
[numpy.timedelta64,
 numpy.signedinteger,
 numpy.integer,
 numpy.number,
 numpy.generic,
 object]

In [13]: paste
   df.select_type(include=['object'])
## -- End pasted text --
Out[13]:
   a
0  a
1  b
2  c

@cpcloud cpcloud added this to the 0.14.1 milestone Jun 11, 2014
@cpcloud cpcloud self-assigned this Jun 11, 2014
@jreback
Copy link
Contributor

jreback commented Jun 11, 2014

should not allow ndim=1 and your indexer for ndim>2 is not right, (its a bit tricky with Panels), as it depends on the orientation to figure out which dtype you'd be getting, so would just disallow anything but ndim==2 for now

@cpcloud
Copy link
Member Author

cpcloud commented Jun 11, 2014

Yep that's the wip part :)

@cpcloud
Copy link
Member Author

cpcloud commented Jun 11, 2014

I'll just implement on frame directly

@jreback
Copy link
Contributor

jreback commented Jun 11, 2014

This is a subtle issue. specifying float should match on the kind of the type rather than `==``

In [1]: np.dtype('float') == 'float32'
Out[1]: False

In [2]: np.dtype('float') == 'float64'
Out[2]: True

In [3]: np.dtype('float') == 'float'
Out[3]: True

need some tests with string dtypes too (and invalid ones I think you should raise), e.g. datetime64[D]. should raise, but datetime and datetime64 I think ok (just because the first is TOO specific)

@cpcloud
Copy link
Member Author

cpcloud commented Jun 11, 2014

I check the dtype.type attribute if it's a subclass of the passed in dtype. All dtypes are converted using np.dtype before any checks occur

@cpcloud
Copy link
Member Author

cpcloud commented Jun 11, 2014

whew jeez there are all sorts of dtype special cases

@cpcloud
Copy link
Member Author

cpcloud commented Jun 11, 2014

@jreback i think 'integer' and 'int' should mean different things to be consistent with numpy; i'm also not really too keen on 'numeric' since that's not a dtype, but i'm allowing 'number'

@cpcloud
Copy link
Member Author

cpcloud commented Jun 11, 2014

not sure i see the problem with datetime64[D] even tho we don't really allow it


# empty include/exclude -> defaults to True
wanted = pd.Series(dict.fromkeys(index, not bool(include)))
not_wanted = pd.Series(dict.fromkeys(index, not bool(exclude)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI you can just do pd.Series(not bool(exclude), index). (Semantically this one seems strange, I think this is actually not_not_wanted?? :s )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah the implementation is a bit unreadable right now ... i'm working on it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. This'll be neat!

@jreback
Copy link
Contributor

jreback commented Jun 22, 2014

hows this coming?

@cpcloud
Copy link
Member Author

cpcloud commented Jun 22, 2014

Will get to it today. Work has picked up in the last week or so, haven't had as much time as I would like for pandas. :(

@cpcloud
Copy link
Member Author

cpcloud commented Jun 22, 2014

@jreback should empty include and empty exclude return an empty frame?

@cpcloud
Copy link
Member Author

cpcloud commented Jun 22, 2014

or how about a raise since that doesn't really make any sense

@cpcloud
Copy link
Member Author

cpcloud commented Jun 22, 2014

oh duh i'm already doing that nevermind

@jreback
Copy link
Contributor

jreback commented Jun 22, 2014

yeh I think you have to have either include or exclude non empty / not none

still could return empty frame of course

@cpcloud
Copy link
Member Author

cpcloud commented Jun 22, 2014

now disallowing datetime64[D] and others like it, i agree with you @jreback that it doesn't make sense ... allowing ns tho, since that's our main representation

@jreback
Copy link
Contributor

jreback commented Jun 22, 2014

you could allow thinks like

'datetime', 'date time', datetime.datetime, np.datetime

eg generic datetime dtypes (you may need to have a dict/regex matcher for things like this)

@cpcloud
Copy link
Member Author

cpcloud commented Jun 22, 2014

i'm already allowing 'datetime', 'datetime64' and 'datetime64[ns]', in addition to the dtype form of the last two... i think that's enough.

@jreback
Copy link
Contributor

jreback commented Jun 22, 2014

ok cool

@cpcloud
Copy link
Member Author

cpcloud commented Jun 22, 2014

the one thing that's currently not possible is getting columns that are non string object dtypes

@cpcloud
Copy link
Member Author

cpcloud commented Jun 22, 2014

ie if you ask for object you'll always get strings as well as any other objects

@jreback
Copy link
Contributor

jreback commented Jun 22, 2014

no distinction between string and those anyhow (their could be but not now)

@@ -1149,11 +1143,97 @@ default value.
s.get('a') # equivalent to s['a']
s.get('x', default=-1)

Selecting columns based on ``dtype``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be here: (the whole section): http://pandas-docs.github.io/pandas-docs-travis/basics.html#dtypes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok will move

@jreback jreback changed the title WIP: select_type impl WIP: select_dypes impl Jun 26, 2014
@cpcloud cpcloud changed the title WIP: select_dypes impl ENH: select_dypes impl Jun 27, 2014
@cpcloud
Copy link
Member Author

cpcloud commented Jun 27, 2014

@jreback @jorisvandenbossche any more comments?

@@ -1603,6 +1603,66 @@ def _get_fill_func(method):
#----------------------------------------------------------------------
# Lots of little utilities

def _validate_date_like_dtype(dtype):
try:
typ = np.datetime_data(dtype)[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never knew that was a method like this!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah only thing is that the second tuple element is num which is used only internall in numpy AFAICT, it seems to always be 1 when used in python but i think it's used to do conversions between different units.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can also just test the kind as M right? (

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no i'm checking that the unit is either generic (this occurs when a user passes in np.datetime64; 'datetime64[generic]' shouldn't really be passed in by a user, tho it will work) or ns, i could write something simple that parses out the unit for us, but not sure that's necessary

@jreback
Copy link
Contributor

jreback commented Jul 1, 2014

I think you just need a couple of doc edits (from above).....

@cpcloud
Copy link
Member Author

cpcloud commented Jul 2, 2014

Yeah sorry things just picking up a lot at work. Will do this today, so we can get the ball rolling on the release.

@cpcloud
Copy link
Member Author

cpcloud commented Jul 6, 2014

@jreback good 2 go?

@jreback
Copy link
Contributor

jreback commented Jul 6, 2014

do you test if df.select_dtypes(['foo']) raises? e.g. a non-dtype (or do you just ignore this?) I think should raise

@cpcloud
Copy link
Member Author

cpcloud commented Jul 6, 2014

good call, don't think i test thits

@jreback
Copy link
Contributor

jreback commented Jul 7, 2014

@cpcloud when you have a chance

@cpcloud
Copy link
Member Author

cpcloud commented Jul 7, 2014

@jreback this is good to go i think




Working with package options
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this supposed to be added?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crap totally missed that htanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you are not rebased to master?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as soon as you are ready then go ahead and merge (don't worry about travis). it already passed :)

@jreback
Copy link
Contributor

jreback commented Jul 7, 2014

I think this section slipped back in somehow: http://pandas-docs.github.io/pandas-docs-travis/options.html (as the new options section is there)

@jreback
Copy link
Contributor

jreback commented Jul 7, 2014

hmm. go ahead and merge this. let's fix in another commit (removing the options stuff from basics) https://github.com/pydata/pandas/pull/7578/files

@jreback
Copy link
Contributor

jreback commented Jul 7, 2014

never mind, master looks fine

@cpcloud
Copy link
Member Author

cpcloud commented Jul 7, 2014

cool merging

cpcloud added a commit that referenced this pull request Jul 7, 2014
@cpcloud cpcloud merged commit bcbc7af into pandas-dev:master Jul 7, 2014
@cpcloud cpcloud deleted the with-types branch July 7, 2014 16:22
@jreback
Copy link
Contributor

jreback commented Jul 7, 2014

nice !

@jreback
Copy link
Contributor

jreback commented Jul 7, 2014

I think you can add a mini example in 0.14.1 for select_dtypes (can be same as from docs).....

@jreback
Copy link
Contributor

jreback commented Jul 7, 2014

you can now update SO with your docs link: http://pandas-docs.github.io/pandas-docs-travis/basics.html#basics-selectdtypes for the gazillion questions that always happen about this :)

@jankatins
Copy link
Contributor

This needs to be tested with dtype "category" #7217

@jreback
Copy link
Contributor

jreback commented Jul 9, 2014

jreback@a43c6c0

makes this work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Dtype Conversions Unexpected or buggy dtype conversions Enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: select_dtypes
5 participants